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ABSTRACT 

Distributed computing excels at processing large scale data, 
but the communication cost for synchronizing the shared 
parameters may slow down the overall performance. For¬ 
tunately, the interactions between parameter and data in 
many problems are sparse, which admits efficient partition 
in order to reduce the communication overhead. 

In this paper, we formulate data placement as a graph 
partitioning problem. We propose a distributed partition¬ 
ing algorithm. We give both theoretical guarantees and a 
highly efficient implementation. We also provide a highly 
efficient implementation of the algorithm and demonstrate 
its promising results on both text datasets and social net¬ 
works. We show that the proposed algorithm leads to 1.6x 
speedup of a state-of-the-start distributed machine learning 
system by eliminating 90% of the network communication. 

1. INTRODUCTION 

The importance of large-scale machine learning continues 
to grow in concert with the big data boom, the advances in 
learning techniques, and the deployment of systems that en¬ 
able wider applications. As the amount of data scales up, the 
need to harness increasingly large clusters of machines signif¬ 
icantly increases. In this paper, we address a question that 
is fundamental for applying today’s loosely-coupled “scale- 
out” cluster computing techniques to important classes of 
machine learning applications: 

How to spread data and model parameters across 

a cluster of machines for efficient processing? 

One big challenge for large-scale data processing prob¬ 
lems is to distribute the data over processing nodes to fit 
the computation and storage capacity of each node. For in¬ 
stance, for very large scale graph factorization [2], one needs 
to partition a natural graph in a way such that the mem¬ 
ory, which is required for storing local state of the parti¬ 
tion and caching the adjacent variables, is bounded within 
the capacity of each machine. Similar constraints apply to 
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Figure 1: The amount of (outgoing) network traffic 
versus the size of data in a real text dataset. The 
algorithm uses 16 machines to run 100 iterations. 
The first-order gradients are communicated. 

GraphLab [20, 12], where vertex-specific updates are car¬ 
ried out while keeping other variables synchronized between 
machines. Likewise, in distributed inference for graphical 
models with latent variables [1, 26], the distributed state 
variables must be synchronized efficiently between machines. 
Furthermore, general purposed distributed machine learning 
framework such as the parameter server [18, 8] face similar 
issues when it comes to data and parameter layout. 

Shared parameters are synchronized via the communica¬ 
tion network. The sheer number of parameters and the it¬ 
erative nature of machine learning algorithms often produce 
huge amounts of network traffic. Figure 1 shows that, if 
we randomly assign data (documents) to machines in a text 
classification application, the total amount of network traffic 
is 100 times larger than the size of training data. Specifi¬ 
cally, almost 4 TB parameters are communicated for 300 GB 
training data. Given that the network bandwidth is typically 
much smaller than the local memory bandwidth, this traffic 
volume can potentially become a performance bottleneck. 

There are three key challenges in achieving scalability for 
large-scale data processing problems: 

Limited computation (CPU) per machine: therefore we 
need a well-balanced task distribution over machines. 








Limited memory (RAM) per machine: the amount of 
storage per machine available for processing and caching 
model variables is often constrained to a small fraction 
of the total model. 

Limited network bandwidth: the network bandwidth is 
typically 100 times worse than the local memory band¬ 
width. Thus we need to reduce the amount of commu¬ 
nication between machines. 

One key observation is the sparsity pattern in large scale 
datasets: most documents contain only a small fraction of 
distinct words, and most people have only a few friends in 
a social graph. Such nonuniformity and sparsity is both 
a boon and a challenge for the problem of dataset parti¬ 
tioning. Due to its practical importance, even though the 
dataset partitioning problems are often NP hard [28], it is 
still worth seeking practical solutions that outperform ran¬ 
dom partitioning, which typically leads to poor performance. 

Our contributions: In this paper, we formulate the task 
of data and parameter placement as a graph partitioning 
problem. We propose Parsa, a PARallel Submodular Ap¬ 
proximation algorithm for solving this problem, and we an¬ 
alyze its theoretical guarantees. A straightforward imple¬ 
mentation of the algorithm has running time in the order of 
0{k\E\‘^), where k is the number of partitions and |^| is the 
number of edges in the graph. Using an efficient vertex selec¬ 
tion data structure, we provide an efficient implementation 
with time complexity 0{k\E\). We also discuss the tech¬ 
niques including sampling, initialization and parallelization 
to improve the partitioning quality and efficiency. 

Experiments on text datasets and social networks of var¬ 
ious scales show that, on both partition quality and time 
efficiency, Parsa outperforms state-of-the-art methods, in¬ 
cluding METIS [16], PaToH [6] and Zoltan [9]. Parsa can 
also significantly accelerate the parameter server, a state- 
of-the-art general purpose distributed machine learning, on 
data of hundreds GBs size and with billions parameters. 

2. GRAPH PARTITIONING 

In this section, we first introduce the inference problem 
and the model of dependencies in distributed inference. Then 
we provide the formulation of the data partitioning problem 
in distributed inference. We also present a brief overview of 
related work in the end. 

2.1 Inference in Machine Learning 

In machine learning, many inference problems have graph- 
structured dependencies. Eor instance, in risk minimiza¬ 
tion [14], we strive to solve 

m 

minimize R[w] l{xi,yi,w) ^[w], (I) 

i=l 

where l{xi, yi,w) is a loss function measuring the model fit¬ 
ting error in the data {xi,yi), and U[ie] is a regularizer on 
the model parameter w. The data and parameters are of¬ 
ten correlated only via the nonzero terms in which ex¬ 
hibit sparsity patterns in many applications. Eor example, 
in email spam filtering, elements of Xi correspond to words 
and attributes in emails, while in computational advertising, 
they correspond to words in ads and user behavior patterns. 

Eor undirected graphical models [4, 17], the joint distri¬ 
bution of the random variables in logscale can be written as 


X2 = (-, -3, _) 
X 3 = (_, .4, .3) 
X 4 = (_, .9, _) 



Figure 2: Modeling dependences as bipartite graph 


a summation of potential of all the cliques in the graph, and 
each clique potential 'ijjciwc) only depends on the subset of 
variables wc in the clique C. 

The learning and inference problems in undirected graph¬ 
ical models are often formulated as an optimization problem 
in the following form: 

minimize R[w] := 'ipd'Wc), (2) 

cec 

where local variables interact through the model parameters 
Wc of the cliques. 

Similar problems occur in the context of inference on nat¬ 
ural graphs [3, 12, 2], where we have sets of interacting pa¬ 
rameters represented by vertices on the graph, and manip¬ 
ulating a vertex affects all of its neighbors computationally. 

2.2 Bipartite Graphs 

The dependencies in the inference problems above can be 
modeled by a bipartite graph G(U, V, E) with vertex sets U 
and V and edge set E. We denote the edge between two 
node u ^ U and v E V by {u,v) E E. Eigure 2 illustrates 
the case of risk minimization (I), where U consists of the 
samples {{xi,yi)}^-^ and V consists of the parameters in w. 
There is an edge {{xi^yi)^Wj) if and only if the j-th element 
of Xi is non-zero. Therefore, {wj : {{xi^yi)^Wj) E E} is the 
working set of elements of w for evaluating the loss function 
l{xi,yi,w) on the sample {xi,yi). 

We can construct such bipartite graph G{U\V, E') fo en¬ 
code the dependencies in undirected graphical models and 
natural graphs with node set V and edge set E. One con¬ 
struction is to define U' = U, and add an edge (u, v) to the 
edge set E' if they are connected in the original graph. An 
alternative construction is to define the node set U' to be C, 
the set of all cliques of the original graph, and add an edge 
(C, v) to the edge set E' if node v belongs to the clique C 
in the original graph. 

Throughout the discussion, we refer to U as the set of data 
(examples) nodes and V as the set of parameters (results) 
nodes. 

2.3 Distributed Inference 

The challenge for large scale inference is that the size of 
the optimization problem in (2) is too large, and even the 
model w may be too large to be stored on a single machine. 
One solution is to divide exploit the additive form of R[w] 
to decompose the optimization into smaller problems, and 
then employ multiple machines to solve these sub-problems 
while keeping the solutions (parameters) consistent. 

There exist several frameworks to simplify the develop¬ 
ing of efficient distributed algorithms, such as Hadoop [II] 
and its in-memory variant Spark [34] to execute MapRe¬ 
duce programs, and Graphlab for distributed graph com- 


server nodes: 


worker nodes: 



Figure 3: Simplified parameter server architecture. 

machine 0 machine 1 machine 2 



Figure 4: Each machine contains a server and a 
worker, holding a part of U and F, respectively. The 
inter-machine dependencies (edges) are highlighted 
and the communication costs for these three ma¬ 
chines are 2, 3, and 3, respectively. Moving the 3rd 
vertex in V to either machine 0 or 1 reduces cost. 

putation [21]. In this paper, we focus on the parameter 
server framework [18], a high-performance general-purpose 
distributed machine learning framework. 

In the parameter server framework, computational nodes 
are divided into server nodes and worker nodes, which are 
shown in Figure 3. The globally shared parameters w are 
partitioned and stored in the server nodes. Each worker 
node solves a sub-problem and communicates with the server 
nodes in two ways: to push local results such as gradients 
or parameter updates to the servers, and to pull recent pa¬ 
rameter (changes) from the servers. Both push and pull 
are executed asynchronously. 

2.4 Multiple Objectives of Partitioning 

In distributed inference, we divide the problem in (2) by 
partition the cost function R[w] as well as the associated 
dependency graph into k blocks. Without loss of generality 
we consider a parameter server with k server nodes and k 
worker nodes, and each machine has exactly one server and 
one worker (otherwise we can aggregate multiple nodes in 
the same machines without affecting the following analysis). 
For the bipartite dependency graph G{U, V, E), we partition 
the parameter set V into k parts and assign each part to a 
server node, and we partition the data set U into k parts and 
assign them to individual worker nodes. Figure 4 illustrates 
an example for k = 3. More specifically, we want to divide 
both U and V into k non-overlapping parts 

k k 

U=\JUi and V=\J Vi, (3) 

i=l i=l 


and assign the part Ui and Vi to the worker node and server 
node on machine i respectively. 

There are three goals when implementing the graph par¬ 
titioning: 

Balancing the computational load. We want to ensure 
that each machine has approximately the same computa¬ 
tional load. Assume that each example Ui incurs roughly the 
same workload, then one of the objective to keep max^ \Ui\ 
small: 


minimize max|f/i| (4) 

i 

Satisfying the memory constraint. Inference algorithms 
frequently access the parameters (at random). Workers keep 
these parameters in memory to improve performance, yet 
RAM is limited. Denote by Af{ui) the neighbor set of Ui 

N(ui) = {vj : {ui,Vj) G E} . (5) 

Then UnGt/ •^('^) is the working set of the parameters worker 
i needed. For simplicity we assume that each parameter Vj 
has the same storage cost. Our goal to limit the worker’s 
memory footprint is given by 

minimize max\JV{Ui)\ where A/’(f/z) := [J JV{u) (6) 

ueUi 

Minimizing the communication cost. The total com¬ 
munication cost per worker i is \JV{Ui)\, which is already 
minimized using our previous goal (6). To further reduce 
this cost, we can assign server i to the same machine with 
worker z, so that any communication uses memory rather 
than network. This reduces the inter-machine communica¬ 
tion cost to \Af{Ui) \ — \Af{Ui)\Vi\. Figure 4 shows an exam¬ 
ple. Further note that if Vj is not needed by worker z, then 
server z should never maintain Vj. In other words, we have 
Vi C J\f{Ui) and the cost simplifies to \Af{Ui) \ — \Vi\. 

On the other hand, '^j^i \ Vi r\JV{Uj)\ is the communi¬ 
cation cost of server z because other workers must request 
parameters from server z. Therefore, the goal to minimize 
the maximal communication cost of a machine is 

minimize max |A/’(f/i)| — 1141 + 114 fl A/’(f/j)|. (7) 

2.5 Related Work 

Graph partitioning has attracted much interest in scien¬ 
tific computing [16, 6, 9], scaling out large-scale computa¬ 
tions [12, 35, 33, 5, 13, 31], graph databases [25, 32, 7], 
search and social network analysis [24, 31, 2], and streaming 
processing [28, 27, 22, 30]. 

Most previous work, such as METIS [16], is concerned 
with edge cuts. Only a few of them solve the vertex cut 
problem, which is closely related to this paper, to directly 
minimize the network traffic. PaToH [6] and Zoltan [9] used 
multilevel partitioning algorithms related to METIS, while 
PowerGraph [12] adopted a greedy algorithm. Very recently 
[5] studied the relation between edge cut and vertex cut. 

Different to these works, we propose a new algorithm 
based on submodular approximation to solve the vertex-cut 
partitioning problem. We give theoretical analysis of the 
partition quality, and describe an efficient distributed imple¬ 
mentation. We show that the proposed algorithm outper¬ 
forms the state of the art on several large scale real datasets 
in both in terms of quality and speed. 



















































Algorithm 1 Partition U via submodular approximation 
Input: Graph G, ^partitions k, maximal ^iterations n, 
residue 0, and improvement a 
Output: Partitions of = Uti u, 

1: for z = 1,..., /c do 
2 : 

3: define gi{T) := f{T U Ui) — a\T U Ui\ 

4: end for 

5: for t = 1, n do 

6: if < kO then break 

7: find i ^ argmin^- | Uj \ 

8: draw R C U hy choosing u E U with probability 

9: if \R\ > 2n/k then next 

10: solve T* = argmin^^^^ ^^(T) 

11: if gi{T*) < 0 then Ui ^ Ui U T* 8.nd U ^ U \ T* 

12: end for 

13: if |?7| > kO then return fail 

14: evenly assign the remainder U to Ui 


3. ALGORITHM 

In this section, we present our algorithm Parsa for solving 
the partitioning problem with multiple objectives in (4), (6) 
and (7). 

Note that (6) is equal to a /c-way graph partition problem 
on vertex set U with vertex-cut as the merit. This prob¬ 
lem is NP-Complete [6]. Furthermore, (7) is more complex 
because of the involvement of V. Rather than solving all 
these objectives together, Parsa decomposes this problem 
into two tasks: partition the data U by solving (4) and (6), 
and given the partition of U partition the parameters V 
by solving (7). Intuitively, we first assigns data workers to 
balance the CPU load and minimize the memory footprint, 
and then distribute the parameters over servers to minimize 
inter-machine communication. 

3.1 Partitioning u over Worker Nodes 

Note that f{U) := |A/’(?7)| is a set function in the variable 
U. It is a submodular function similar to convex and con¬ 
cave functions in real variables. Although the problem in 
(4) is NP-Complete, there exist several algorithms to solve 
it approximately by exploiting the submodularity [29]. In 
our algorithm, we modified [29] to solve (4) and (6). The 
key difference is that we build up the sets Ui incrementally, 
which is important for both partition quality and computa¬ 
tional efficiency at a later stage. 

As shown in Algorithm 1, the algorithm proceeds as fol¬ 
lows: in each round we pick the smallest partition Ui and 
find the best set of elements to add to it. To do so, we first 
draw a small subset of candidates R and select the best sub¬ 
set using a minimum-increment weight via miuTCiz /(U^ U 
T) — Oi\Ui{JT\. If the optimal solution T* satisfies f {Ui U 
T*) < ajUi U T*|, i.e., the cost for increasing Ui is not too 
large, we assign T* to partition Ui. 

Before showing the implementation details in Section 4, 
we first analyze the partitioning quality of Algorithm 1. 

Proposition 1 Assume that there exists some partitioning 
Ui that satisfies maxi /(U*) < B. Let k > 0 = yn/logn, 
c = ( 327 r)~ 2 ^ Q/ = BK/\/n logn and t — ^ log Then 
Algorithm 1 will sueeeed with probability at least p and it will 
generate a feasible solution with partitioning eost at most 


Algorithm 2 Partition V for given {Ui}^^^ 

Input: The neighbor sets {J\f{Ui)}i=i 
Output: Partitions V = Uti 
1: for z = 1,. .. /c do 
2: U ^ 0 

3: costi ^ \J\f{Ui)\ 

4: end for 

5: for all j E V do 

6: f ^ argmini^„.^^ocosti 

7: Vj ^ Vj U {j} 

8: cost^ ^ cost^ — 1 + 

9: end for 


maxi f{Ui) < AB^Jn/ logn. 

Proof. The proof is near-identical as [29]. Note that 
we overload the meaning of U as it refers to the remaining 
variables in the algorithm. 

For a given iteration, without loss of generality we as¬ 
sume that Ui maximizes \U* r\U\ for all j. Denote this by 
V = |Ui nU|. Since T* is the optimal solution at the current 
iteration we have gi{T*) < gi{Ui fl U) = f{{Ui fl U) U Ui) — 
a \ {Ui DU) U Ui\. Further note that by monotonicity and 
submodularity/((urnU)UU0 < f{U^nU)Tf{Ui). More¬ 
over, U nUi =0 holds since U contains only the leftovers. 
Consequently \{U^ D U) U Ui\ = \U^ D U\T\Ui\. Finally, the 
algorithm only increases the size of Ui whenever the cost is 
balanced. Hence f{Ui) — a\Ui\ < 0. Combining this yields 

gi{T*) < gfiUi nu)< f{Ui n U) - a\Ui nu\<B-av 

Using the results from the proof of [29, Theorem 5.4] we 
know that ^ and therefore gi(T*) happens with proba¬ 
bility at least . Hence the probability of removing at least 
one vertex from U within an iteration is greater than 

Chernoff bounds show that after r = —n^/clog(l — (5) 
iterations the algorithm will terminate with probability at 
least p since the residual U is small, i.e., |U| < kO. 

The algorithm will never select a Ui for augmentation un¬ 
less \Ui\ < n/k (there would always be a smaller set). More¬ 
over, the maximum increment at any given time is 2rz/A:. 
Hence \Ui\ < 3n/k and therefore f{Ui) < 3na/k. 

Finally, the contribution of the unassigned residual U is at 
most OB since each Ui is incremented by at most 0 elements 
and since f{u) < B for all zz G U. In summary, this yields 
f{Ui) < 3na/k T BO = AB^n/ logn. □ 


3.2 Partitioning u over Server Nodes 

Next, given the partition of U, we find an assignment of 
parameters in V to servers. We reformulate (7) as a con¬ 
vex integer programming problem with totally unimodular 
constraints [15], which is then solved using a sequential op¬ 
timization algorithm performing a sweep through the vari¬ 
ables. 

We define index variables Vij G {0,1}, j = 1,..., /c to 
indicate which server node maintains a particular parameter 
Vi. They need to satisfy — 1- Moreover, denote by 

Uij G {0,1} variables that record whether j G JV{Ui). Then 












Algorithm 3 Partitioning U efficiently 
Input: Graph G{U,V, E), ^partitions k, and initial neigh¬ 
bor sets 

Output: Partitioned U = U^=i updated neighbor 

sets which are equal to {Si UN'{Ui)}i=i 
1: for z = 1,..., /c do 
2 : Ui i — 0 

3: for all u e U do Ai{u) = | J\f{u) \ Si \ 

4: end for 

5: while |t/| > 0 do 

6: pick partition i ^ argmiuj \Sj\ 

7: pick the lowest-cost vertex u* ^ Ai.min 

8: assign u* to partition i: Ui ^ Ui U {zz*} 

9: remove u* from U\ U ^ U \ {zz*} 

10: for j = 1,... ,k do remove zz* from Aj 

11: for V e A^(zz*) \ Si do 

12: Si^ SiU {z;} 

13: for u G Af{v) DU do Ai{u) ^ Ai{u) — 1 

14: end for 

15: end while 


we can rewrite (7) as a convex integer program: 


minimize 


max \Af{Ui) I + ^ Vr 


- 1 + y^^uij 

l^i 


(8a) 


subject to ^^Vij = 1 and Vij € {0,1} and Vij < Uij (8b) 


Here we exploited the fact that ^jVijUij = \Vi r]Af{Ui)\ 
and that Vij = \Vi\. These constraints are totally uni- 

modular, since they satisfy the conditions of [15]. As a con¬ 
sequence every vertex solution is integral and we may relax 
the condition Vij G {0,1} to Vij G [0,1] to obtain a convex 
optimization problem. 

Algorithm 2 performs a single sweep over (8) to find a 
locally optimal assignment of one variable at a time. We 
found that it is sufficient for a near-optimal solution. Re¬ 
peated sweeps over the assignment space are straightforward 
and will improve the objective until convergence to optimal¬ 
ity in a finite number of steps: due to convexity all local 
optima are global. Further note that we need not store the 
full neighbor sets in memory. Instead, we can perform the 
assignment in a streaming fashion. 


4. EFFICIENT IMPLEMENTATION 

The time complexity of Algorithm 2 is 0{k{\U\ + |H|)), 
however, it could be 0{k\Uf) for Algorithm 1, which is 
infeasible in practice. We now discuss how to implement 
Algorithm 1 efficiently. We first present how to find the 
optimal T* and sample R. Then we address the parallel im¬ 
plementation with the parameter server, and finally describe 
neighbor set initialization to improve the partition quality. 

4.1 Finding T* efficiently 

The most expensive operation in the inner loop of Algo¬ 
rithm 1 is step 10, determining which vertices, T*, to add 
to a partition. Submodular minimization problems incur 
0{n^) time [23]. Given the fact that this step is invoked fre¬ 
quently and the problem is large, this strategy is impractical. 
A key approximation Parsa made is to add only a single ver- 



Figure 5: Vertex costs are stored in an array. The i- 
th entry is used for vertex zz^, where assigned vertices 
are marked with gray color. Header points and a 
doubly-linked list afford faster access. 

tex at a time instead of a set of vertices: Given a vertex set 
R and partition z, it finds vertex zz* that minimizes 

zz* = argmin^i(zz) := |A/’({zz} U Ui)\ — o|{zz} U Uf (9) 

uER 

An additional advantage of this approximation is that we 
are now solving exactly the GPU load balancing problem 
(4) . Since we only assign one vertex at a time to the smallest 
partition, we obtain perfect balancing. 

Even though this approximation improves the performance, 
a naive way to calculate (9) is to compute all gi{u) to find 
zz* with the minimal value for each iteration. If the size of 
R is a constant fraction of the entire graph, this leads to 
an undesirable time complexity of (!1(|U| lEl). This remains 
impractical for graphs with billions of vertices and edges. 

We accelerate computation as follows: we store all ver¬ 
tex costs to avoid re-computing them, and we create a data 
structure to locate the lowest-cost vertex efficiently. 

Storing vertex costs. If we subtract the constant \J\f{Ui)\-\- 
a(\Ui \ + 1) from gi{u), we obtain the vertex cost 

cosU(zz) := \JV{Ui U {zz})| - \JV{Ui)\. (10) 

This is the number of new vertices that would be added to 
the neighbor set of partition z due top adding vertex zz to z. 

When adding zz to a partition z, only the costs of a few 
vertices will be changed. Denote by 

Ai := {z; G Af{u) : v ^ Af{Ui)} 

the set of new vertices will be added into the neighbor set 
of Ui when assigning zz to partition z. Only vertices in U 
connected to vertices in will have their costs affected, and 
these costs will only be reduced and never increased. Due to 
the sparsity of the graph, this is often a small subset of the 
total vertices. Hence the overhead of updating the vertex 
costs is much smaller than re-computing them repeatedly. 

Fast vertex cost lookup. We build an efficient data 
structure to store the vertex costs, which is illustrated at 
Figure 5. For each partition z, we use an array Ai to store 
the j-ih vertex cost, costi(zzj), in the j-ih entry, denoted by 
Ai{uj). We then impose a doubly-linked list on top of this 
array in an increasing order to rapidly locate the lowest-cost 
vertex. When a vertex cost is modified (always reduced), we 
update the doubly-linked list to preserve the order. 

Note that most large-scale graphs have a power-law degree 
distribution. Therefore a large portion of vertex costs will be 
small integers, which are always less equal that their degrees. 
We store a small array of “head” pointers to the locations in 
the list where the cost jumps to 0,1, 2,... ^. The pointers 
accelerate locating elements in the list when updating. In 
practice, we found 0 = 1000 covers over 99% of vertex costs. 























The algorithm is illustrated in Algorithm 3. The inputs 
are a bipartite graph G = (t/, V, E)^ the number of partitions 
/c, together with k sets Si CV, which is the union neighbor 
set of vertices have been assigned to partition i before. The 
outputs are the k partitions U and updated Si with 

the neighbor set of Ui included. Here we assume R—U^ the 
sampling strategy of R will be addressed in next section. 

Runtime. The initial Ai{u) can be computed in 0(\E\) 
time and then be ordered in 0{\U\) hy counting sort, as they 
are integers, upper bounded by the maximal vertex degree. 
The most expensive part of Algorithm 3 is updating Ai in 
step 13. This is are evaluated at most k\E\ times because, 
for each partition, a vertex v E V together with its neighbors 
is accessed at most once. 

For most cases, the time complexity of updating the doubly- 
linked lists is 0(1). The cost to access the j-ih vertex is 0(1) 
due to the sequential storing on an array. Finding a vertex 
with the minimal value or removing a vertex from the list is 
also in 0(1) time because of the doubly links. Keeping the 
list ordered after decreasing a vertex cost by 1 is 0(1) in 
most cases (0(|f/|) for the worst case), as discussed above, 
by using the cached head pointers. 

The average time complexity of Algorithm 3 is then 0{k\E\), 
much faster than the naive implementation and orders of 
magnitude better than Algorithm 1. 

4.2 Division into Subgraphs 

One goal of the sampling strategy used in Algorithm 1 is 
to keep the partitions of U balanced, because the vertices 
assigned to a partition at a time is being limited. The addi¬ 
tional constraint |T| = 1 introduced in the previous section 
ensures that only a single vertex is assigned each time, which 
addresses balancing. Consequently we would like to sample 
as many vertices as possible to enlarge the search range of 
the optimal u* to partition quality. Sampling remains ap¬ 
pealing since it is a trade-off between computation efficiency 
and partition quality. 

Parsa first randomly divides U into b blocks. It next con¬ 
structs the corresponding b subgraphs by adding the neigh¬ 
bor vertices from V and the corresponding edges, and then 
partitions these subgraphs sequentially by Algorithm 3. In 
other words, denote by {Gj}^=i the b subgraphs, and {Si}i=i 
the initialized neighbor sets, for instances Si = 9 for all i; 
at iteration j = 1,..., 6, we sequentially feed Gj and Si^s 
into Algorithm 3 to obtain the partitions IJ^=i 
and updated SiS^ which contain the previous partition in¬ 
formation. Then we union the results on each subgraph to 
the final partitions oil! hy Ui — U5=i Ui,j for i = 1,... ,fc. 

Compared to the scheme described in Section 3.1 which 
samples a new subgraph {R) for each single vertex assign¬ 
ment, Parsa fixes those subgraphs at the beginning. This 
sampling strategy has several advantages. First, it parti¬ 
tions a subgraph by directly using Algorithm 3, which takes 
advantage of the head pointers and linked list to improve 
the efficiency. Next, it is convenient to place both the ini¬ 
tialization of neighbor sets and parallelization which will be 
introduced soon on subgraph granularity. Finally, this strat¬ 
egy is I/O efficient, because we must only keep the current 
subgraph in memory. As a result, it is possible to partition 
graphs of sizes much larger than physical memory. 

The number of subgraphs 5 is a trade-off between partition 
quality and computational efficiency. In the extreme case of 
5=1, the vertex assigned to a partition is the optimal one 


from all unassigned vertices. It is, however, the most time 
consuming. In contrast, though the time complexity reduces 
lo 0{\E\) when letting b — |f/|, we only get random partition 
results. Therefore, a well-chosen size b not only removes the 
graph size constraint but also balances time and quality. 

4.3 Parallelize with the Parameter Server 

Although Parsa can partition very large graphs with a sin¬ 
gle process by taking advantage of sampling, parallelization 
is desirable because of the reduction of both CPU and I/O 
times on each machine. Parsa parallelizes the partitioning 
by processing different subgraphs in parallel (on different 
nodes) by using the shared neighbor sets. 

To implement the algorithm using the parameter server, 
we need the following three groups of nodes: 

The scheduler issues partitioning tasks to workers and mon¬ 
itors their progress. 

Server nodes maintain the global shared neighbor sets. 
They process push and pull requests by workers. 

Worker nodes partition subgraphs in parallel. Every time 
a worker first reads a subgraph from the (distributed) 
file system. It then pulls the newest neighbor sets as¬ 
sociated with this subgraph from the servers. Then, it 
partitions this subgraph using Algorithm 3 and finally 
pushes the modified neighbor sets to the servers. 

4.4 Initializing the Neighbor Sets 

The neighbor sets play a similar role as cluster centers on 
clustering methods, both of which affect the assignment of 
vertices. Well-initialized neighbor sets potentially improve 
the partition results. Initialization by empty sets, which 
prefers assigning vertices with small degrees first, however, 
often helps little, or even degrades, the resulting assignment. 
Parsa uses several initialization strategies to improve the 
results: 

Individual initialization. Given a graph that has been 
divided into b subgraphs, we can runs a + 5 iterations 
where the results for the first a iterations are used for 
initialization. In other words, before processing the 
[j + l)-th subgraph, j < a + 1, we reset the neighbor 
set by Si — N{Ui^j)^ where {Ui^j}^=i are the partitions 
of j-th subgraph. The old results are dropped because 
otherwise a vertex u will be assigned to its old partition 
i again as Si contains the neighbors of u and the cost 
|A/’(u) \ Si\ will then be 0. 

Global initialization. In parallel partitioning, before start¬ 
ing all workers, we first sample a small part from the 
graph and then let one worker partition this small sub¬ 
graph. Then we can use the resulting neighbor sets as 
an initialization to all workers. 

Incremental partitioning. In this setting, data arrives in 
an incremental way and we want to partition the new 
data efficiently. Since we already have the partitioning 
results on the old data, we can use these results as 
initialization of the neighbor sets. 

4.5 Puting it all together 

Algorithm 4 shows Parsa, which partitions U into k parts 
in parallel. Then we can assign V using Algorithm 2 if 


Algorithm 4 Parsa: parallel submodular approximation 

Input: Graph G, initial neighbor sets ^partitions 

/c, max delay r, initialization from a, ^subgraphs b. 
Output: partitions U = Uti Ui 

Scheduler: 

1: divide G into h subgraphs 

2: ask all workers to partition with (a, r, true) 

3: ask all workers to partition with (6, r, false) 

Server: 

1: start with a part of {Si}^^-^ 

2: if receiving a pull request then 

3: reply with the requested neighbor set {Si}^^^ 

4: end if 

5: if receiving a push request containing then 

6: if initializing then 

7: S'i ^ for z = 1,..., /c 

8: else 

9: Si ^ SiUSr"" for z = 

10: end if 

11: end if 
Worker: 

1: receive hyper-parameters (T, r, initializing) 

2: for t = 1,..., T do 
3: load a subgraph G{U,V,E) 

4: wait until all pushes before time t — r finished 

5: pull the part of neighbor sets, that contained 

in V from the servers 

6: get partitions {U^^^}i=i and updated neighbor sets 

using Algorithm 3 

7: if initializing = false and t > 1 then 

8: ^ \ for all z = 1,..., /c 

9: end if 

10: push to servers 

11: if not initializing then Ui ^ UiU for all z 

12: end for 


necessary. The initial neighbor sets can be obtained from 
global initialization or incremental partitioning discussed in 
the previous section. There are several details worth not¬ 
ing: First, while communication in the parameter server is 
asynchronous, Parsa imposes a maximal allowed delay r to 
control the data consistency. Second, the worker might only 
push the changes of the neighbor sets to the servers to save 
the communication traffic. Finally, a worker may start a 
separate data pre-fetching thread to run steps 3, 4 and 5 to 
improve the efficiency. 

5. EXPERIMENTS 

We chose 7 datasets of varying type and scale, as sum¬ 
marized in Table 1. The first three are text datasets^; live- 
journal and orkut are social networks^; and the last two are 
click-through rate datasets from a large Internet company. 
The numbers of vertices and edges range from 10^ to 10^°. 

5.1 Setup 

We implemented Parsa in the parameter server [18]; the 
source is available at https://github.com/mli/parsa. We 

^http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ 
^http://snap.stanf ord.edu/data/ 


name 

IC^I 

1^1 

\E\ 

type 

rcvl 

20K 

47K 

IM 

bipartite 

news20 

20K 

IM 

9M 

bipartite 

KDDa 

8M 

20M 

305M 

bipartite 

live-journal 

5M 

5M 

69M 

directed 

orkut 

3M 

3M 

113M 

undirected 

CTRa 

IM 

4M 

120M 

bipartite 

CTRb 

lOOM 

3B 

lOB 

bipartite 


Table 1: A collection of real datasets. 

compared Parsa with popular vertex-cut graph partition 
toolboxes Zoltan^ and PaToH^, which can also take bipartite 
graphs as inputs. We also used the well-known graph parti¬ 
tion package METIS^ and the greedy algorithm adopted by 
Powergraph®, though these handle only normal graphs. All 
algorithms are implemented in C/C++. 

We report both runtime and partition results. The default 
measurement of the latter is the maximal individual traf¬ 
fic volume. We counted the improvement against random 
partition by (random — proposed)/proposed x 100%, where 
100% improvement means that traffic or memory footprint 
are 50% of that achieved by random partitioning. 

The default number of partitions was set to 16. As Parsa 
is a randomized algorithm, we recorded the average results 
over 10 trials. Single thread experiments used a desktop with 
an Intel i7 3.4GHz CPU, while the parallel experiments used 
a university cluster with 16 machines, each with an Intel 
Xeon 2.4GHz CPU and 1 Gigabit Ethernet. 

5.2 Comparison to other Methods 

Table 2 shows the comparison results on different datasets. 
We recorded the CPU time on running each algorithm ex¬ 
cept for loading the data, because its performance varies for 
different data formats each algorithm used. The improve¬ 
ments are measured on maximal individual memory foot¬ 
print and traffic volume, together with total traffic volume, 
which is the objective for both PaToH and Zoltan. Since 
neither METIS nor PowerCraph handle general sparse ma¬ 
trices, only results on social networks are reported. The 
number of partitions is 16, and the parameters of Parsa are 
fixed by a = 5 = 16. See Eigure 6 for improvements on 
maximal individual traffic volume and runtimes. 

As can be seen, Parsa is not only 20x faster than PaToH 
and Zoltan, but also produces more stable partition results, 
especially on reducing the memory footprint. METIS out¬ 
performs Parsa on one of the two social networks but con¬ 
sumes twice as much CPU time. PowerCraph is the fastest 
but suffers the cost of worse partition quality. Under both 
measurements on maximal individual traffic volume and to¬ 
tal traffic volume, Parsa produces similar results. 

As the number of partitions increases, the recursive-bisection- 
based algorithms (METIS, PaToH, and Zoltan) retain their 
runtimes, but their partition quality degrades, as shown in 
Eigure 7. In contrast, Parsa and Powergraph compute k- 
partitions directly. Their runtimes increase linearly with k, 
but their partition quality actually improves. 

5.3 Number of subgraphs and initialization 

^http://www.cs.sandia.gov/Zoltan/ 

"^http: //bmi . osu. edu/~uniit/software. html 

^http://glaros.dtc.umn.edu/gkhome/metis/metis/overview 

®http://graphlab.org/downloads/ 
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123 389 

23 
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214 267 

21 

23 

187 155 

1 

CTRa 
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446 970 

571 

70 

1052 1211 

551 

n 

922 913 

18 

KDDa 






54 

905 1102 

1401 

4 

238 313 

2409 

120 

1973 1978 

89 

live-journal 

61 84 

89 

9 

185 231 279 

65 

103 

152 160 

3.5h 

50 

84 3M 

1072 

142 

216 214 

37 

orkut 

55 74 

78 

12 

56 74 103 

104 

87 

145 150 

5.5h 

49 

170 180 

1413 

105 

177 121 

39 


Table 2: Improvements (%) comparing to random partition on the maximal individual memory footprint 
Mmax? maximal individual traffic volumes Tmax? and total traffic volumes Tsum together with running times 
(in sec) on 16-partition. The best results are colored by Red and the second best by Green. 
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Figure 7: Improvement over random partitioning when changing the number of partitions k. Top: CTRa; 
Bottom: live-journal. Note that even both Parsa and PowerGraph use more time when k increases, they 
improve the partition quality. 


We examine two important optimizations in Parsa: the 
size of subgraph and the neighbor set initialization. We first 
consider the single thread case, which starts with empty 
neighbor sets. We divide the data into different numbers 
of subgraphs b and also varying the number of subgraphs 
a used for initialization. The results on representative text 
dataset CTRa and social network live-journal are shown in 
Figure 8. 

The x-axis plots a/bx 100%, which is the percent of data 
used for constructing the initialization. It is clear that using 


more data for initialization improves partition quality. Al¬ 
though the improvement is not significant for the single sub¬ 
graph case (6=1) because partitioning the same subgraph 
several times changes the results little, there is a stable 20% 
improvement over no initialization when at least 100% data 
are used for 6 > 1. 

Without initialization, using small subgraphs has a pos¬ 
itive effect on partition results for live-journal, but not for 
CTRa. The reason, as mentioned in Section 4.2, is that 
Parsa prefers to assign vertices with small degrees first when 
starting with empty neighbor sets. Those vertices offer lit- 









































Figure 8: Varying the number of subgraphs and percent of data used for initialization for single thread 
partitioning. Top: CTRa; Bottom: live-journal. 


tie or no benefit for the subsequent assignment. Live-journal 
has many more sparse vertices than CTRa due to the power 
law distribution, and partitioning small subgraphs reduces 
the number of sparse vertices entering partitions too early. 

Initialization solves the previous problem by dropping early 
partition results and resetting corresponding neighbor sets. 
With a > 16 in Figure 8, small subgraphs improve the par¬ 
tition results on both CTRa and live-journal. This occurs 
since when using the same percentage of data for initializa¬ 
tion, neighbor sets with small b are reset more often. 

Figure 8 also shows the runtime. Splitting into more 
blocks (larger b) narrows the search range for adding ver¬ 
tices, which reduces the cost of operating the doubly-linked 
list, boosting speed. The runtime increases linearly as we 
use and discard more samples for initialization, but the par¬ 
tition quality benefits of doing so appear worthwhile up to 
performing two passes (100% samples). 

Next we consider the parallel case with non-empty starting 
neighbor sets. We use 4 workers to partition a subset of 
CTRb containing 1 billion of edges and use one worker to 
partition an even smaller subgraph to obtain the starting 


neighbor sets. The results of varying the size of the this 
subgraph are shown in Figure 9. 

As can be seen, the partition quality is significantly im¬ 
proved even when only 0.1% data are used for the global 
initialization. In addition, although this initialization takes 
extra time, the total running time is minimized when we 
used initialization. A good initialization of the neighbor sets 
reduces the cost of operating the doubly linked lists, saving 
time. 

5.4 Scalability 

We test the scalability of Parsa on CTRb with 10 billion 
edges by increasing the number of machines. We run 4 work¬ 
ers and 4 servers at each machine with infinite maximum 
delay. The results are shown in Figure 1. As can be seen, 
the speedup is linear with the number of machines and close 
to the ideal case. In particular, we obtained a 13.7x speedup 
by increasing the number of machines from 1 to 16. 

The main reason that Parsa scales well is due to the even¬ 
tual consistency model (r = oo). In this model, there is no 
global barrier between workers, and each worker even does 
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Figure 6: Visualization of comparisons from Table 2. 
Top: text datasets; Bottom: Social Networks; Note 
that on text datasets (naturally bipartite graphs) 
Parsa is both orders of magnitude faster and yields 
better results. On social networks PowerGraph is 
faster, which is to be expected since it uses very 
simple and fast partitioning. 


not wait the previous results pushed successfully. There¬ 
fore, workers fully utilize the computational resource and 
network bandwidth, and waste no time on waiting the data 
synchronization. 

This consistency model, however, potentially leads to in¬ 
consistency of the neighbor sets between workers. However, 
we found that Parsa is robust to this kind of inconsistency. 
In our experiment, increasing the number of machines from 
1 to 16 (4 to 64 in terms of workers) only decreases the 
quality of the partition result at most by 5%. We believe 
the reason is twofold. First, the starting neighbor sets ob¬ 
tained on a small subgraph let all workers have a consistent 
initialization, which may contain the membership of most 
head (large degree) vertices in V. Second, the modihca- 
tions of the neighbor sets each worker contributed after par¬ 
titioning a subgraph therefore are mainly about tail (small 
degree) vertices in V. Due to the extreme sparsity of the 


Figure 9: Varying the percentage of data used 
for global initializing the neighborhood sets with 4 
workers. 


tail vertices, the conflicts among workers could be small and 
therefore affect the results little. 

5.5 Accelerating Distributed Inference 

Finally we examine how much Parsa can accelerate dis¬ 
tributed machine learning applications by better data and 
parameter placement. We consider -regularized logistic 
regression, which is one of the most widely used machine 
learning algorithm for large scale text datasets. We choose a 
state-of-the-art distributed inference algorithm, DBPG [19], 
to solve this application. It is based on the block proxi¬ 
mal gradient method using several techniques to improve 
efficiency: it supports a maximal r-delay consistency model 
similar to Parsa, and uses several user-dehned hlters, such 
as key caching, value compression, and an algorithm-specihc 
KKT hlter, to further reduce communication cost. This al¬ 
gorithm has been implemented in the parameter server [18], 
and is well optimized. It can use 1,000 machines to train £i- 
regularized logistic regression on 500 terabytes data within 
a hour [19]. 

































Figure 10: The scalability of Parsa on dataset CTRb. 

We run DBPG on CTRb using 16 machines as the baseline. 
We enabled all optimization options described in [19]. Then 
we partition CTRb into 16 parts by Parsa and run DBPG 
again. The runtime is shown in Table 3. By random par¬ 
titioning, DBPG stops after passing the data 45 times and 
uses 1.43 hours. On the other hand, Parsa uses 4 minutes to 
partition the data and then accelerates DBPG to 0.84 hour. 
As a result, Parsa can reduce the total time from 1.43 hours 
to 0.91 hourx, a 1.6x speedup. 

The reason Parsa accelerates DBPG is shown clearly in 
Table 3. By random partition, only 6% of network traffic 
between servers and workers happens locally. Even though 
the prior work reported very low communication cost for 
DBPG [19], we observe that a significant amount of time was 
spent on data synchronization. The reason is twofold. First, 
[19] pre-processed the data to remove tail features (tail ver¬ 
tices in V) before training. But we fed the raw data into the 
algorithm and let the ^i-regularizer do the feature selection 
automatically, which often yields a better machine learning 
model but induces more network traffic. Second, the net¬ 
work bandwidth of the university cluster we used is 20 times 
less than the industrial data-center used by [19]. Therefore, 
the communication cost can not be ignored in our exper¬ 
iment. However, after the partitioning, the inter-machine 
communication is decreased from 4.2TB to 0.3TB. Further¬ 
more, the ratio of inner-machine traffic increases from 6% 
to 92%. In total, inter-machine communication is decreased 
by more than 90%, which significantly speeds inference. 

6 . CONCLUSION 

This paper presented a new parallel vertex-cut graph par¬ 
tition algorithm, Parsa, to solve the data and parameter 
placement problem. Our contributions are the following: 

• We give theoretical analysis and approximation guar¬ 
antees for both decomposition stages of what is gener¬ 
ally an NP hard problem. 

• We show that the algorithm can be implemented very 
efficiently by judicious use of a doubly-linked list in 
0(k\E\) time. 


method 

partition inference 

total 

random 

Parsa 

Oh 1.43h 

0.07h 0.84h 

1.43h 

0.91h 


Table 3: Time for ^i-regularized logistic regression 
on CTRb on 16 machines requiring 45 data passes. 


method 

inner-machine inter-machine 

total 

random 

Parsa 

0.27 4.23 

3.68 0.32 

4.51 

4.00 


Table 4: Total data (TB) sent during inference. 

• We provide technologies such as sampling, initializa¬ 
tion, and parallelizaiton, to improve the speed and 
partition quality. 

• Experiments show that Parsa works well in practice, 
beating (or matching) all competing algorithms in both 
memory footprint and communication cost while also 
offering very fast runtime. 

• We used Parsa to accelerate a stat-of-the-art distributed 
solver for -regularized logistic regression implemented 
in parameter server. We observed a 1.6x speedup on 16 
machines with a dataset containing 10 billion nonzero 
entries. 

In summary, Parsa is a fast, relatively simple, highly scalable 
and well performing algorithm. 
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